Project Overview/Objectives:
Context:
There is a huge demand for used cars in the Indian Market today. As sales of new cars have slowed down in the recent past, the pre-owned car market has continued to grow over the past years and is larger than the new car market now. Cars4U is a budding tech start-up that aims to find footholes in this market.
In 2018-19, while new car sales were recorded at 3.6 million units, around 4 million second-hand cars were bought and sold. There is a slowdown in new car sales and that could mean that the demand is shifting towards the pre-owned market. In fact, some car sellers replace their old cars with pre-owned cars instead of buying new ones. Unlike new cars, where price and supply are fairly deterministic and managed by OEMs (Original Equipment Manufacturer / except for dealership level discounts which come into play only in the last stage of the customer journey), used cars are very different beasts with huge uncertainty in both pricing and supply. Keeping this in mind, the pricing scheme of these used cars becomes important in order to grow in the market.
As a senior data scientist at Cars4U, you have to come up with a pricing model that can effectively predict the price of used cars and can help the business in devising profitable strategies using differential pricing. For example, if the business knows the market price, it will never sell anything below it.
Objective:
To explore and visualize the dataset, build a linear regression model to predict the prices of used cars, and generate a set of insights and recommendations that will help the business.
Data Dictionary (Feature : Explaination )
The data contains the different attributes of used cars sold in different locations. The detailed data dictionary is given below.
S.No. : Serial number
Name : Name of the car which includes brand name and model name
Location : Location in which the car is being sold or is available for purchase (cities)
Year : Manufacturing year of the car
Kilometers_driven : The total kilometers (a unit used to measure length or distance) driven in the car by the previous owner(s)
Fuel_Type : The type of fuel used by the car (Petrol, Diesel, Electric, CNG, LPG)
Transmission : The type of transmission used by the car (Automatic/Manual)
Owner : Type of ownership
Mileage : The standard mileage offered by the car company in kmpl or km/kg
Engine : The displacement volume of the engine in CC
Power : The maximum power of the engine in bhp
Seats : The number of seats in the car
New_Price : The price of a new car of the same model in INR Lakhs (1 Lakh INR = 100,000 INR)
Price : The price of the used car in INR Lakhs
Scoring Rubric & Best Practices for Notebook
The notebook should be well-documented, with inline comments explaining the functionality of code and markdown cells containing comments on the observations and insights.
The notebook should be run from start to finish in a sequential manner before submission.
It is preferable to remove all warnings and errors before submission. The notebook should be submitted as an HTML file (.html) and NOT as a notebook file (.ipynb).
# importing all of the required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
# loading the data set
pd.set_option('display.max_columns', None)
from google.colab import files
data_to_load = files.upload()
import io
df = pd.read_csv(io.BytesIO(data_to_load['used_cars_data.csv']))
Objective:
We will start with EDA and Data Preprocessing to understand the data and prepare it for modelling.
# checking the shape of the data set
print(f'There are {df.shape[0]} rows and {df.shape[1]} columns.')
df.info()
df.dtypes
df.isnull().sum().sort_values(ascending=False)
# checking for duplicate entries in the data set
df.duplicated().sum()
# dropping duplicates
df.drop_duplicates(inplace=True)
Observations
np.random.seed(1)
df.sample(n=10)
df.describe(include='all').T
Observations
# Dropping serial number column because it will not be neccessary for the model
df.drop(['S.No.'], axis=1, inplace=True)
# Creating a new column for car age
current_year = 2022
df['Car_Age'] = current_year - df['Year']
df.head()
# Converting year column to datetime datatype
df['Year'] = pd.to_datetime(df['Year'], format='%Y')
#checking our changes
df.head()
# Splitting the name coluumn into Make and Model columns
# creating a column for car model
df['Car_Model'] = df.Name.str.split().str[0:3]
# joining the split list values in the car model column
df['Car_Model'] = df.Car_Model.apply(lambda x: ' '.join(x))
# creating a column for car brand
df['Car_Brand'] = df.Name.str.split().str[0]
# dropping the initial Name column
df.drop(['Name'], axis=1, inplace=True)
# taking a look at the changes made
df.head()
Next:
# removing the units from Mileage, Engine, and Power and changing the coluumn datatype to float
df['Power'] = df.Power.str.split().str.get(0).astype('float64')
df['Engine'] = df.Engine.str.split().str.get(0).astype('float64')
df['Mileage'] = df.Mileage.str.split().str.get(0).astype('float64')
# taking a look at the changes we've made to confirm it worked correctly
df.head()
# converting Location, Fuel_Type, Transmission, and Owner_Type to categorical variables with one-hot encoding
df2 = pd.get_dummies(df, columns=['Location'])
df3 = pd.get_dummies(df2, columns=['Fuel_Type'])
df4 = pd.get_dummies(df3, columns=['Transmission'])
df5 = pd.get_dummies(df4, columns=['Owner_Type'])
#checking the changes made
df5.head()
# converting Car_Brand and Car_Model to categorical variables with one-hot encoding
df6 = pd.get_dummies(df5, columns=['Car_Brand'])
df7 = pd.get_dummies(df6, columns=['Car_Model'])
We will start by looking at the individual columns. (Univariate Analysis)
# installing pandas profiling library
!pip install -U pandas_profiling
# importing pandas profiler
from pandas_profiling import ProfileReport
# generating a pandas profile report to visualize the data / get a further understanding
df.profile_report()
Obersvations about Columns:
Year Column
# Year Column
sns.histplot(data=df, x='Year', kde=True);
Kilometers_Driven Column
# Kilometers_Driven Column
sns.boxplot(data=df, x='Kilometers_Driven');
# Kilometers_Driven Column
sns.histplot(data=df, x='Kilometers_Driven', kde=False);
We see that the entire data is highly skewed and being affected by the presence of an outlier(s) which seems to be really far out. Let's see if the box-plot improves when we set show fliers to false.
# boxplot of Kilometers driven without outliers
sns.boxplot(data=df, x='Kilometers_Driven', showfliers = False);
We see this has helped significantly, it would be best to look more into the outlier(s) in this columnn so they don't negatively impact our model.
# Treating outlier(s) in Kilometers_Driven
# function to return lower and upper bounds of IQR
def outlier_treatment(column):
sorted(column)
Q1,Q3 = np.percentile(column , [25,75])
IQR = Q3 - Q1
lower_range = Q1 - (1.5 * IQR)
upper_range = Q3 + (1.5 * IQR)
return lower_range,upper_range
# using the function to return the lower and upper IQR bounds for Kilometers_Driven
lowerbound,upperbound = outlier_treatment(df['Kilometers_Driven'])
print(lowerbound, upperbound)
# returning the rows where Kilometers_Driven contains outliers
df[(df['Kilometers_Driven'] < lowerbound) | (df['Kilometers_Driven'] > upperbound)]
# function to set outliers to the lower and upper bounds of IQR respectively
def treat_outliers(df, col):
Q1 = df[col].quantile(0.25) # lower bound of IQR
Q3 = df[col].quantile(0.75) # upper bound of IQR
IQR = Q3 - Q1 # IQR
lower_whisker = Q1 - 1.5 * IQR
upper_whisker = Q3 + 1.5 * IQR
# all the values < lower bound will be set to the value of the lower bound
# all the values > upper bound will be set to the value of the upper bound
df[col] = np.clip(df[col], lower_whisker, upper_whisker)
return df
# treating outliers using the function
df = treat_outliers(df,'Kilometers_Driven')
# visualizing the column after outlier treatment
sns.boxplot(data=df,x='Kilometers_Driven')
plt.show()
We see our box-plot looks normal without having to set fliers to false so this coluumn looks good now. Let's plot a histogram to double check our changes.
sns.histplot(data=df, x='Kilometers_Driven', kde=True);
We see a significant improvement from before.
Mileage Column
# Mileage Column
sns.histplot(data=df, x='Mileage', kde=True);
Engine Column
# Engine Column
sns.histplot(data=df, x='Engine', kde=True);
Power Column
# Power Column
sns.histplot(data=df, x='Power', kde=True);
Seats Column
# Seats Column
sns.histplot(data=df, x='Seats', kde=True);
New_Price Column
# New Price Column
sns.histplot(data=df, x='New_Price', kde=True);
Price Column
# Price Column
sns.histplot(data=df, x='Price', kde=True);
Car_Age Column
# Car_Age Column
sns.histplot(data=df, x='Car_Age', kde=True);
Car_Brand Column
# Car_Brand Column
sns.countplot(data=df, x='Car_Brand');
plt.xticks(rotation=90)
# calculating the number of unique Models for each Brand
Cargroups = df.groupby('Car_Brand')['Car_Model']
Cargroups.nunique()
Cargroups2 = Cargroups.nunique()
Cargroups2
Location Column
# Location Column
sns.countplot(data=df, x='Location');
plt.xticks(rotation=90)
Fuel_Type Column
# Fuel_Type Column
sns.countplot(data=df, x='Fuel_Type');
plt.xticks(rotation=90)
Transmission Column
# Transmission Column
sns.countplot(data=df, x='Transmission');
plt.xticks(rotation=90)
Owner_Type Column
# Owner_Type Column
sns.countplot(data=df, x='Owner_Type');
plt.xticks(rotation=90)
Now we will do a Bivariate analysis and observe relationships between different columns in the dataset.
# creating a pair plot to observe relationships between the columns
sns.pairplot(data=df)
# creating a heatmat to observe relationships between the columns
sns.heatmap(data=df.corr(), annot=True);
The heatmap and pairplots confirm the observations made about correlations made from the Pandas Profile Report. This will help us later with feature extraction.
Relationship Between Car_Brand and Price
sns.scatterplot(data=df, x='Car_Brand', y='Price', hue="Location");
plt.xticks(rotation=90);
plt.legend(bbox_to_anchor=(1.5, 1), borderaxespad=0)
sns.lineplot(data = df , x = 'Car_Brand' , y = 'Price');
plt.xticks(rotation=90);
Relationship between Price and Location
sns.scatterplot(data=df, x='Location', y='Price');
plt.xticks(rotation=90);
sns.lineplot(data = df , x = 'Location' , y = 'Price');
plt.xticks(rotation=90);
Relationship between Price and Transmission
sns.scatterplot(data=df, x='Transmission', y='Price');
plt.xticks(rotation=90);
sns.lineplot(data = df , x = 'Transmission' , y = 'Price');
plt.xticks(rotation=90);
Relationship between Price and Fuel_Type
sns.scatterplot(data=df, x='Fuel_Type', y='Price');
plt.xticks(rotation=90);
sns.lineplot(data = df , x = 'Fuel_Type' , y = 'Price');
plt.xticks(rotation=90);
Relationship between Price and Owner_Type
sns.scatterplot(data=df, x='Owner_Type', y='Price');
plt.xticks(rotation=90);
sns.lineplot(data = df , x = 'Owner_Type' , y = 'Price');
plt.xticks(rotation=90);
# lets take another look at which columns having missing values again
df.isnull().sum().sort_values(ascending=False)
Mileage Column
# extracting information of other variables where Mileage is null
df.loc[df['Mileage'].isnull()==True]
We see both rows with null values for Mileage have Automatic Transmission, Electric Fuel Type, and First Owner_Type. We can impute the missing values based off this information and using the Car_Model and Car_Brand. The distribution of Mileage is normal for the most part besides some larger values at the lower bound, so we will impute with the median.
df.groupby(['Car_Model','Car_Brand'])[['Transmission','Owner_Type','Fuel_Type','Mileage']].median()
# df7['Mileage'] = df7['Mileage'].fillna(value = df7.groupby(['Car_Model','Car_Brand'])['Mileage'].transform('median'))
# df = df.groupby(['Car_Model','Car_Brand'])[['Transmission','Owner_Type','Fuel_Type','Mileage']].median()
# df7['Mileage'] = df7['Mileage'].fillna(value = df7.groupby(['Car_Brand','Car_Model'])['Mileage'].transform('mean'))
# df7['Mileage'] = df7['Mileage'].fillna(df7.groupby(['Car_Model', 'Car_Brand'])['Mileage'].transform('median'))
# Did not end up using this code ^
# Kept getting errors trying to impute with GroupBy so I decided to impute values based on the mean or median of the columns
# The Mileage column is normally distributed so we will impute missng values with the mean
# Checking the mean of the Mileage column
df7['Mileage'].mean()
# Imputing the missing values in Mileage with the mean
df7['Mileage'].fillna(value=df7['Mileage'].mean(), inplace=True)
Engine Column
# The Engine column's distribution is skewed so we will impute missing values with the median
# Checking the median of the Engine column
df7['Engine'].median()
# Imputing the missing values in Engine with the median
df7['Engine'].fillna(value=df7['Engine'].median(), inplace=True)
Seats Column
# The Seats column is normally distributed so we will impute missing values with the mean
# Checking the mean of the Seats column
df7['Seats'].mean()
# Imputing the missing values in Seats with the mean
df7['Seats'].fillna(value=df7['Seats'].mean(), inplace=True)
Power Column
# The Power column's distribution is skewed so we will impute missing values with the median
# Checking the median of the Engine column
df7['Power'].median()
# Imputing the missing values in Power with the median
df7['Power'].fillna(value=df7['Power'].median(), inplace=True)
Price Column
# The Price column's distribution is skewed so we will impute missing values with the median
# Checking the median of the Price column
df7['Price'].median()
# Imputing the missing values in Price with the median
df7['Price'].fillna(value=df7['Price'].median(), inplace=True)
# Checking that the values have been imputed correctly
df7.isnull().sum().sort_values(ascending=False)
Everything looks good so we will now move onto feature extraction.
# first we will drop the Year column as it is highly correlated with Car_Age
df7.drop(['Year'], axis=1, inplace=True)
# next we will drop the Power column as it is highly correlated with Engine, New_Price, Price
df7.drop(['Power'], axis=1, inplace=True)
Building the Model
# dropping Price from X to prepare for modelling
X = df7.drop("Price", axis=1)
y = df7["Price"]
# splitting the dataset into training set (70%) and test set (30%)
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.30, random_state=1)
# fitting the model to the training set
regression_model = LinearRegression()
regression_model.fit(X_train, y_train)
# calculating the R-squared value of the training set
print(regression_model.score(X_train, y_train),)
# calculating the RMSE value of the training set
print(np.sqrt(mean_squared_error(y_train, regression_model.predict(X_train))))
# calculating the R-squared value of the test set
print(regression_model.score(X_test, y_test),)
# calculating the RMSE value of thee test set
print(np.sqrt(mean_squared_error(y_test, regression_model.predict(X_test))))
# taking a look at the coefficients and intercepts
equationinfo = pd.DataFrame(
np.append(regression_model.coef_, regression_model.intercept_),
index=X_train.columns.tolist() + ["Intercept"],
columns=["Coefficients"],
)
equationinfo
# exporting the linear regression equation
Equation = "Price = " + str(regression_model.intercept_)
print(Equation, end=" ")
for i in range(len(X_train.columns)):
if i != len(X_train.columns) - 1:
print(
"+ (",
regression_model.coef_[i],
")*(",
X_train.columns[i],
")",
end=" ",
)
else:
print("+ (", regression_model.coef_[i], ")*(", X_train.columns[i], ")")
Since the Price column is skewed, we will modify the Price column to log of Price, rebuild the model, & compare the score to the original model.
# creating a copy of the dataframe
df8 = df7.copy(deep=True)
# checking the new dataframe
df8.head()
# creating a new column for the log of the Price column
df8['PriceLog'] = np.log(df8['Price'])
# dropping the original price column from df8
df8.drop(['Price'], axis=1, inplace=True)
Now we will rebuild the model using the new dataframe wth the PriceLog column.
# dropping PriceLog from X to prepare for modelling
X2 = df8.drop("PriceLog", axis=1)
y2 = df8["PriceLog"]
# splitting the dataset into training set (70%) and test set (30%)
X2_train, X2_test, y2_train, y2_test = train_test_split( X2, y2, test_size=0.30, random_state=1)
# fitting the model to the training set
regression_model2 = LinearRegression()
regression_model2.fit(X2_train, y2_train)
# calculating the R-squared value of the training set
print(regression_model2.score(X2_train, y2_train),)
# calculating the RMSE value of thee training set
print(np.sqrt(mean_squared_error(y2_train, regression_model2.predict(X2_train))))
# calculating the R-squared value of the test set
print(regression_model2.score(X2_test, y2_test),)
# calculating the RMSE value of the test set
print(np.sqrt(mean_squared_error(y2_test, regression_model2.predict(X2_test))))
We see that the R-squared values have improved significantly for both the training and test set in the model using the PriceLog column. The distance between the R-squared values of the training and test set has decreased too (~16 in the original model vs ~9 in the model using PriceLog).
Let's now look at the coefficients and intercept, and get the linear regression equation of the new model with PriceLog.
# taking a look at the coefficients and intercepts
equationinfo2 = pd.DataFrame(
np.append(regression_model2.coef_, regression_model2.intercept_),
index=X2_train.columns.tolist() + ["Intercept"],
columns=["Coefficients"],
)
equationinfo2
# exporting the linear regression equation
Equation = "PriceLog = " + str(regression_model2.intercept_)
print(Equation, end=" ")
for i in range(len(X2_train.columns)):
if i != len(X2_train.columns) - 1:
print(
"+ (",
regression_model2.coef_[i],
")*(",
X2_train.columns[i],
")",
end=" ",
)
else:
print("+ (", regression_model2.coef_[i], ")*(", X2_train.columns[i], ")")